La industria aérea se caracteriza por una intensa competencia por lo que resulta de suma importancia conocer a los clientes, sus intereses y preferencias. Este trabajo se centra en analizar los datos de encuestas realizadas a pasajeros de una cierta aerolínea para evaluar distintos aspectos y cómo afectan en el nivel de satisfacción, para esto se utilizarán técnicas de Data Science.
Objetivos del modelo
El objetivo general del presente trabajo es predecir a través de un modelo de Machine Learning la satisfacción con la mayor asertividad posible, según ciertos contextos y analizar qué variables impactan con mayor correlación en la satisfacción, realizando el entrenamiento de un modelo predictivo
Se plantean como objetivos específicos: * Conocer las características y preferencias de los clientes de acuerdo al género, edad, tipo de cliente y clase de vuelo que utilizan. * Identificar cuáles son los servicios que deben mejorarse, y ver si se asocian a las características generales de los clientes * Analizar las características generales de los vuelos que tienen mayores inconvenientes (demoras en partida/arribo por ej), y ver si la información recolectada es de utilidad para proponer soluciones a los mismos. * Desarrollar un modelo predictivo que permita identificar el nivel de satisfacción de los pasajeros respecto a los servicios brindados
Descripción de los datos
Los datos se obtuvieron del repositorio www.kaggle.com. Se trata de una base de datos estructurados, generada a partir de encuestas realizadas a más de 100k clientes. La misma cuenta con campos que permiten describir las características generales del cliente, como género, edad, tipo de viaje, categoría de pasajero, distancia del vuelo; como así también cuáles son las opiniones del mismo en relación a distintos aspectos del viaje. En este punto, utilizando escalas de Likert, se les consultó sobre distintos aspectos con el grado de satisfacción donde : 0 correspondía a variables donde la respuesta No Aplica, y los puntajes de 1 a 5 indican el nivel de satisfacción de los pasajeros.
Gender: Género de los pasajeros (variable categórica, “Female /Male”)
Customer Type: tipo de cliente, categorizado como: “cliente leal” / “cliente desleal”) (Loyal customer, disloyal customer)
Age: Edad actual de los pasajeros (variable numérica, en años)
Type of Travel: Motivo del viaje de los pasajeros (variable categórica: Personal Travel / Business Travel)
Class: Tipo de clase en la que viajaban los pasajeros (variable categórica: Business / Eco / Eco Plus)
Flight distance: Distancia recorrida en el viaje (variable numérica, en kilómetros)
Inflight wifi service: Nivel de satisfacción respecto al servicio de wifi durante el vuelo (escala de likert) (
Departure/Arrival time convenient: Nivel de satisfacción en relación a la conveniencia entre el tiempo de partida/arribo (escala de likert)
Ease of Online booking: Nivel de satisfacción respecto a la reserva online (escala de likert)
Gate location: Nivel de satisfacción respecto a la ubicación de la puerta de embarque en el aeropuerto (escala de likert)
Food and drink: Nivel de satisfacción respecto a la comida y bebida (escala de likert)
Online boarding: Nivel de satisfacción respecto a Satisfaction level of online boarding (escala de likert)
Seat comfort: Nivel de satisfacción respecto a la comodidad de los asientos (escala de likert)
Inflight entertainment: Nivel de satisfacción respecto al entretenimiento durante el vuelo (escala de likert)
On-board service: Nivel de satisfacción respecto al servicio durante el vuelo (escala de likert)
Leg room service: Nivel de satisfacción respecto al servicio de espacios para piernas Satisfaction level of Leg room service (escala de likert)
Baggage handling: Nivel de satisfacción respecto al manejo del equipaje (escala de likert)
Check-in service: Nivel de satisfacción respecto al servicio de check-in (escala de likert)
Inflight service: Nivel de satisfacción respecto al servicio durante el vuelo (escala de likert)
Cleanliness: Nivel de satisfacción respecto a la limpieza (escala de likert)
Departure Delay in Minutes: minutos de demora en la partida (variable numérica, en minutos)
Arrival Delay in Minutes: minutos de demora en el arribo (variable numérica, en minutos)
Satisfaction: Nivel de satisfacción respecto a la aerolínea en general, medido como “satisfactorio” o “neutral / no satisfactorio”.
Data Wrangling y EDA
Paquetes Numpy, Pandas, Matplotlib, Seaborn, Letsplot y Sklearn
Code
import numpy as npimport pandas as pdfrom matplotlib import pyplot as pltimport seaborn as snsimport plotly.express as pximport ioimport requestsurl_train ="https://github.com/jonezequiel92/Airline-Passenger-Satisfaction/raw/main/train.csv"url_test ="https://github.com/jonezequiel92/Airline-Passenger-Satisfaction/raw/main/test.csv"s_train=requests.get(url_train).contents_test=requests.get(url_test).contentdf_train=pd.read_csv(io.StringIO(s_train.decode('utf-8')))df_test=pd.read_csv(io.StringIO(s_test.decode('utf-8')))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 129880 non-null object
1 Customer Type 129880 non-null object
2 Age 129880 non-null int64
3 Type of Travel 129880 non-null object
4 Class 129880 non-null object
5 Flight Distance 129880 non-null int64
6 Inflight wifi service 129880 non-null int64
7 Departure/Arrival time convenient 129880 non-null int64
8 Ease of Online booking 129880 non-null int64
9 Gate location 129880 non-null int64
10 Food and drink 129880 non-null int64
11 Online boarding 129880 non-null int64
12 Seat comfort 129880 non-null int64
13 Inflight entertainment 129880 non-null int64
14 On-board service 129880 non-null int64
15 Leg room service 129880 non-null int64
16 Baggage handling 129880 non-null int64
17 Checkin service 129880 non-null int64
18 Inflight service 129880 non-null int64
19 Cleanliness 129880 non-null int64
20 Departure Delay in Minutes 129880 non-null int64
21 Arrival Delay in Minutes 129487 non-null float64
22 satisfaction 129880 non-null object
dtypes: float64(1), int64(17), object(5)
memory usage: 22.8+ MB
393 valores nulos en Arrival Delay in Minutes
Code
df.isnull().sum()
Gender 0
Customer Type 0
Age 0
Type of Travel 0
Class 0
Flight Distance 0
Inflight wifi service 0
Departure/Arrival time convenient 0
Ease of Online booking 0
Gate location 0
Food and drink 0
Online boarding 0
Seat comfort 0
Inflight entertainment 0
On-board service 0
Leg room service 0
Baggage handling 0
Checkin service 0
Inflight service 0
Cleanliness 0
Departure Delay in Minutes 0
Arrival Delay in Minutes 393
satisfaction 0
dtype: int64
al ser pocos valores nulos, 393 en 129880 registros se eliminan.
se modifica el tipo de dato de Arrival Delay in Minutes de decimal a entero
Code
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].astype('int64')
Se constata la ausencia de Valores faltantes
Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129487 entries, 0 to 129486
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 129487 non-null object
1 Customer Type 129487 non-null object
2 Age 129487 non-null int64
3 Type of Travel 129487 non-null object
4 Class 129487 non-null object
5 Flight Distance 129487 non-null int64
6 Inflight wifi service 129487 non-null int64
7 Departure/Arrival time convenient 129487 non-null int64
8 Ease of Online booking 129487 non-null int64
9 Gate location 129487 non-null int64
10 Food and drink 129487 non-null int64
11 Online boarding 129487 non-null int64
12 Seat comfort 129487 non-null int64
13 Inflight entertainment 129487 non-null int64
14 On-board service 129487 non-null int64
15 Leg room service 129487 non-null int64
16 Baggage handling 129487 non-null int64
17 Checkin service 129487 non-null int64
18 Inflight service 129487 non-null int64
19 Cleanliness 129487 non-null int64
20 Departure Delay in Minutes 129487 non-null int64
21 Arrival Delay in Minutes 129487 non-null int64
22 satisfaction 129487 non-null object
dtypes: int64(18), object(5)
memory usage: 22.7+ MB
variables categóricas
Code
df.dtypes[df.dtypes =='object']
Gender object
Customer Type object
Type of Travel object
Class object
satisfaction object
dtype: object
valores únicos en las variables categóricas
Code
for i in df.dtypes[df.dtypes=='object'].index:print(i)print(df[i].unique())
Gender
['Male' 'Female']
Customer Type
['Loyal Customer' 'disloyal Customer']
Type of Travel
['Personal Travel' 'Business travel']
Class
['Eco Plus' 'Business' 'Eco']
satisfaction
['neutral or dissatisfied' 'satisfied']
crear dummies para las variables categóricas
Code
df['transformed_Gender'] = df['Gender'].map({'Male':1,'Female':0})df['transformed_Customer Type'] = df['Customer Type'].map({'Loyal Customer':1,'disloyal Customer':0})df['transformed_Type of Travel'] = df['Type of Travel'].map({'Business travel':1,'Personal Travel':0})df['transformed_Class'] = df['Class'].map({'Business':2,'Eco Plus':1,'Eco':0})df['transformed_satisfaction'] = df['satisfaction'].map({'satisfied':1,'neutral or dissatisfied':0})
Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129487 entries, 0 to 129486
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 129487 non-null object
1 Customer Type 129487 non-null object
2 Age 129487 non-null int64
3 Type of Travel 129487 non-null object
4 Class 129487 non-null object
5 Flight Distance 129487 non-null int64
6 Inflight wifi service 129487 non-null int64
7 Departure/Arrival time convenient 129487 non-null int64
8 Ease of Online booking 129487 non-null int64
9 Gate location 129487 non-null int64
10 Food and drink 129487 non-null int64
11 Online boarding 129487 non-null int64
12 Seat comfort 129487 non-null int64
13 Inflight entertainment 129487 non-null int64
14 On-board service 129487 non-null int64
15 Leg room service 129487 non-null int64
16 Baggage handling 129487 non-null int64
17 Checkin service 129487 non-null int64
18 Inflight service 129487 non-null int64
19 Cleanliness 129487 non-null int64
20 Departure Delay in Minutes 129487 non-null int64
21 Arrival Delay in Minutes 129487 non-null int64
22 satisfaction 129487 non-null object
23 transformed_Gender 129487 non-null int64
24 transformed_Customer Type 129487 non-null int64
25 transformed_Type of Travel 129487 non-null int64
26 transformed_Class 129487 non-null int64
27 transformed_satisfaction 129487 non-null int64
dtypes: int64(23), object(5)
memory usage: 27.7+ MB
Métrica, Funciones y Gráficos
Code
df.head()
Gender
Customer Type
Age
Type of Travel
Class
Flight Distance
Inflight wifi service
Departure/Arrival time convenient
Ease of Online booking
Gate location
Food and drink
Online boarding
Seat comfort
Inflight entertainment
On-board service
Leg room service
Baggage handling
Checkin service
Inflight service
Cleanliness
Departure Delay in Minutes
Arrival Delay in Minutes
satisfaction
transformed_Gender
transformed_Customer Type
transformed_Type of Travel
transformed_Class
transformed_satisfaction
0
Male
Loyal Customer
13
Personal Travel
Eco Plus
460
3
4
3
1
5
3
5
5
4
3
4
4
5
5
25
18
neutral or dissatisfied
1
1
0
1
0
1
Male
disloyal Customer
25
Business travel
Business
235
3
2
3
3
1
3
1
1
1
5
3
1
4
1
1
6
neutral or dissatisfied
1
0
1
2
0
2
Female
Loyal Customer
26
Business travel
Business
1142
2
2
2
2
5
5
5
5
4
3
4
4
4
5
0
0
satisfied
0
1
1
2
1
3
Female
Loyal Customer
25
Business travel
Business
562
2
5
5
5
2
2
2
2
2
5
3
1
4
2
11
9
neutral or dissatisfied
0
1
1
2
0
4
Male
Loyal Customer
61
Business travel
Business
214
3
3
3
3
4
5
5
3
3
4
4
3
3
3
0
0
satisfied
1
1
1
2
1
Code
df.shape
(129487, 28)
Code
df.describe().T[:-5]
count
mean
std
min
25%
50%
75%
max
Age
129487.000
39.429
15.118
7.000
27.000
40.000
51.000
85.000
Flight Distance
129487.000
1190.211
997.561
31.000
414.000
844.000
1744.000
4983.000
Inflight wifi service
129487.000
2.729
1.329
0.000
2.000
3.000
4.000
5.000
Departure/Arrival time convenient
129487.000
3.057
1.527
0.000
2.000
3.000
4.000
5.000
Ease of Online booking
129487.000
2.757
1.402
0.000
2.000
3.000
4.000
5.000
Gate location
129487.000
2.977
1.279
0.000
2.000
3.000
4.000
5.000
Food and drink
129487.000
3.205
1.330
0.000
2.000
3.000
4.000
5.000
Online boarding
129487.000
3.253
1.351
0.000
2.000
3.000
4.000
5.000
Seat comfort
129487.000
3.442
1.319
0.000
2.000
4.000
5.000
5.000
Inflight entertainment
129487.000
3.358
1.334
0.000
2.000
4.000
4.000
5.000
On-board service
129487.000
3.383
1.287
0.000
2.000
4.000
4.000
5.000
Leg room service
129487.000
3.351
1.316
0.000
2.000
4.000
4.000
5.000
Baggage handling
129487.000
3.632
1.180
1.000
3.000
4.000
5.000
5.000
Checkin service
129487.000
3.306
1.266
0.000
3.000
3.000
4.000
5.000
Inflight service
129487.000
3.642
1.177
0.000
3.000
4.000
5.000
5.000
Cleanliness
129487.000
3.286
1.314
0.000
2.000
3.000
4.000
5.000
Departure Delay in Minutes
129487.000
14.643
37.933
0.000
0.000
0.000
12.000
1592.000
Arrival Delay in Minutes
129487.000
15.091
38.466
0.000
0.000
0.000
13.000
1584.000
Gráficos
Gráfico de Satisfación
Code
plt.figure(figsize=(6,6))labels ='Neutral or dissatisfied', 'Satisfied'explode = (0, 0.1)df_group = df.satisfaction.value_counts(normalize=True).mul(100)df_group.plot.pie(autopct="%.2f", cmap='tab10', labels=labels,explode=explode ,shadow=True).set(title='% de clientes según grado de satisfacción')plt.show()
Gráfico Univariado
Code
plt.figure(figsize=(6,6))v,m,g=plt.hist(df['Age'], color='lightblue')plt.title("Distribución de la edad",size=18)plt.ylabel("Frecuencia",size=14)for i, rect inenumerate(g): posx = rect.get_x() posy = rect.get_height() plt.text(posx+0.5, posy +30, int(v[i]), color='black', fontsize =8,weight='bold')plt.grid(color='r', linestyle='dotted', linewidth=1)plt.show()
gráfico para la conclusión ‘La proporción de hombres y mujeres encuestados es similar.’
class_perc = df['Class'].value_counts(normalize=True).mul(100)fig, ax = plt.subplots(figsize=(6,6))ax.pie(class_perc, labels=class_perc.index, autopct='%1.1f%%', startangle=90)ax.axis('equal') plt.title("Proporción según clase del viaje",size=12)plt.show()
Confort de los Asientos
Code
from seaborn import countplotplt.figure(figsize=(6,6))ax = countplot(df['Seat comfort'], data=df)plt.xticks(size =12)plt.yticks(size =12)plt.ylabel('Cantidad')ax.set(xlabel=None)abs_values = df['Seat comfort'].value_counts(ascending=False).valuesfor p in ax.patches: ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+120),weight='bold')plt.title("Pasajeros según reseña de satisfacción sobre el asiento en vuelo",size=12)plt.grid(color='r', linestyle='dotted', linewidth=1,axis='y')plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
Code
list_poll=['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking','Food and drink','Online boarding','Cleanliness']colores = ["#EE6055","#60D394","#AAF683","#FFD97D","#FF9B85","#FFFFFF"]fig, axarr = plt.subplots(2, 3, figsize=(16, 10))for index,i inenumerate(list_poll): df[i].value_counts().plot.pie(autopct="%.1f",colors=colores,ax=axarr[int(index/3)][index%3])sns.set(font_scale =1.2)plt.title('Proporciones de las encuestas de satisfacción', fontsize =14, fontweight =20)plt.show()
Relación entre Clase y Edad
Code
fig, ax = plt.subplots(figsize=(6,6))ax = sns.boxplot(x=df['Class'], y=df['Age'])ax.set_title('Relación entre Clase y Edad', {'fontsize':14},pad=20)ax.set(xlabel=None)ax.set_ylabel('Edad')plt.grid(color='r', linestyle='dotted', linewidth=1)plt.show()
Relación entre distancia de viaje y satisfacción
Code
plt.figure(figsize=(6,6))sns.violinplot(x='satisfaction', y='Flight Distance',data=df, palette ='colorblind')plt.title('Relación entre Distancia del viaje y Satisfacción', fontsize =14, fontweight =30)plt.ylabel('Distancia del viaje')plt.xlabel('')plt.grid(color='r', linestyle='dotted', linewidth=1)plt.show()
Code
pd.crosstab(df.Class, df.satisfaction)
satisfaction
neutral or dissatisfied
satisfied
Class
Business
18940
43050
Eco
47215
10902
Eco Plus
7070
2310
Code
pd.crosstab(df['Type of Travel'], df.satisfaction)
satisfaction
neutral or dissatisfied
satisfied
Type of Travel
Business travel
37238
52207
Personal Travel
35987
4055
Code
df.groupby('Type of Travel')['Flight Distance'].describe()
count
mean
std
min
25%
50%
75%
max
Type of Travel
Business travel
71655.00
1368.29
1086.68
31.00
451.00
986.00
2143.00
4983.00
Personal Travel
32249.00
792.08
592.27
31.00
363.00
628.00
1023.00
4983.00
Gráfico Multivariado
Code
sns.set(font_scale =1)plt.figure(figsize=(6,6))sns.violinplot(x="satisfaction", y="Age", hue="Gender", data=df, palette=['#008B8B','#00FFFF'], split=True, scale="count")plt.title('Relación Género y Edad, según satisfacción', fontsize =14, fontweight =20)plt.xlabel('')plt.ylabel('Edad')plt.grid(color='r', linestyle='dotted', linewidth=1,axis='y')plt.show()
Relación entre los tiempos de espera y la satisfacción
Code
plt.figure(figsize=(6,6))sns.scatterplot(data=df, x='Arrival Delay in Minutes',y='Departure Delay in Minutes',hue='satisfaction',palette='gist_rainbow_r', alpha=0.8)plt.grid() # agregar una grillaplt.show()
sns.FacetGrid(df,hue ='Class',height=1.5, aspect=2,size=15).map(plt.scatter,'Flight Distance','Age').add_legend();sns.set(font_scale =2)plt.title('Relacion Edad vs Distancia segun clase', fontsize =20, fontweight =30)plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
warnings.warn(msg, UserWarning)
Satisfación según tipo de Cliente
Code
fig, ax = plt.subplots(figsize=(6,6))ax = sns.countplot(x ='Customer Type', palette ="Set2", data = df, hue='satisfaction' ) ax.set_title('Relación entre Tipo de cliente y Satisfacción', {'fontsize':14},pad=20)ax.set(ylabel=None)for p in ax.patches: ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+120),weight='bold')plt.grid(color='r', linestyle='dotted', linewidth=1,axis='y')plt.show()
promedio de indicadores según clase y satisfacción
# Checkin Service, Inflight Service, On-board Service, Leg-room Servicelist_service_color=[['Checkin service','Oranges'],['Inflight service','Blues'],['On-board service','pink'],['Leg room service','bone']]#lista para iterar y hacer grafico de calorfig, axarr = plt.subplots(2, 2, figsize=(12, 8))for index,i inenumerate(list_service_color): servicio,color=i sns.heatmap(pd.crosstab(df['satisfaction'], df[servicio]), cmap=color, ax = axarr[int(index/2)][index%2])
Data Preprocessing
Code
df.head()
Gender
Customer Type
Age
Type of Travel
Class
Flight Distance
Inflight wifi service
Departure/Arrival time convenient
Ease of Online booking
Gate location
Food and drink
Online boarding
Seat comfort
Inflight entertainment
On-board service
Leg room service
Baggage handling
Checkin service
Inflight service
Cleanliness
Departure Delay in Minutes
Arrival Delay in Minutes
satisfaction
transformed_Gender
transformed_Customer Type
transformed_Type of Travel
transformed_Class
transformed_satisfaction
0
Male
Loyal Customer
13
Personal Travel
Eco Plus
460
3
4
3
1
5
3
5
5
4
3
4
4
5
5
25
18
neutral or dissatisfied
1
1
0
1
0
1
Male
disloyal Customer
25
Business travel
Business
235
3
2
3
3
1
3
1
1
1
5
3
1
4
1
1
6
neutral or dissatisfied
1
0
1
2
0
2
Female
Loyal Customer
26
Business travel
Business
1142
2
2
2
2
5
5
5
5
4
3
4
4
4
5
0
0
satisfied
0
1
1
2
1
3
Female
Loyal Customer
25
Business travel
Business
562
2
5
5
5
2
2
2
2
2
5
3
1
4
2
11
9
neutral or dissatisfied
0
1
1
2
0
4
Male
Loyal Customer
61
Business travel
Business
214
3
3
3
3
4
5
5
3
3
4
4
3
3
3
0
0
satisfied
1
1
1
2
1
Se eliminan columnas categóricas sin transformación
Code
columnas_a_eliminar = ['Gender','Customer Type','Type of Travel','Class','satisfaction']df.drop(columns=columnas_a_eliminar,inplace=True)
Se eliminan registros con valor 0 (cero) que significa (no responde/no aplica) en indicadores de satisfacción
Code
columnas_indicadores = ['Inflight wifi service','Departure/Arrival time convenient', 'Ease of Online booking','Gate location', 'Food and drink', 'Online boarding', 'Seat comfort','Inflight entertainment', 'On-board service', 'Leg room service','Baggage handling', 'Checkin service', 'Inflight service','Cleanliness']
contamos registros con 0 (no responde/no aplica) en cada indicador de satisfacción
Code
for i in columnas_indicadores:print(i)print(len(df[df[i] ==0]))
Inflight wifi service
3908
Departure/Arrival time convenient
6664
Ease of Online booking
5666
Gate location
1
Food and drink
130
Online boarding
3071
Seat comfort
1
Inflight entertainment
18
On-board service
5
Leg room service
596
Baggage handling
0
Checkin service
1
Inflight service
5
Cleanliness
14
contamos cuanto aparece (en porcentaje) cada valor en cada indicador de satisfacción
Code
for i in columnas_indicadores:print(f'{i} (%)') valores =round(df[i].value_counts(normalize=True).mul(100).sort_values()).astype('int')print(valores.to_string())
eliminamos registros con 0 (no responde/no aplica) en los indicadores de satisfacción
Code
print(f'Tamaño del dataframe original: {df.shape}') for i in columnas_indicadores: df = df[df[i] !=0]print(f'Tamaño del dataframe: {df.shape}')
Tamaño del dataframe original: (129487, 23)
Tamaño del dataframe: (119204, 23)
Se verifica que no hay registros con 0 (no responde/no aplica) en cada indicador de satisfacción
Code
for i in columnas_indicadores:print(i)print(len(df[df[i] ==0]))
Inflight wifi service
0
Departure/Arrival time convenient
0
Ease of Online booking
0
Gate location
0
Food and drink
0
Online boarding
0
Seat comfort
0
Inflight entertainment
0
On-board service
0
Leg room service
0
Baggage handling
0
Checkin service
0
Inflight service
0
Cleanliness
0
#Me quedo con 30% para test y 70% para trainX_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)
Correlaciones
Code
corr = df.corr(method='spearman')# Máscara para ocultar valores repetidosmask = np.zeros_like(corr, dtype=bool)mask[np.triu_indices_from(mask)] =True# Set up the matplotlib figuref, ax = plt.subplots(figsize=(16, 16))# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corr, annot =True, mask=mask, cmap="YlGnBu", center=0, square=True, linewidths=.5, fmt='.2f')
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
This is separate from the ipykernel package so we can avoid doing imports until
Code
plt.figure(figsize=(6,6))df.corr().iloc[:-1,-1].sort_values().plot(kind='barh',color='g')plt.title('Correlación con variable target: Satisfaction',size=14)plt.show()
De acuerdo a nuestro gráfico de correlación, observamos que aquellas variables que explican en mayor medida la satisfación de nuestros clientes corresponden a
Online Boarding
Class
Type of Travel
Mientras aquellas que peor correlación tienen con nuestra variable de satisfacción son:
Gate Location
Gender
Departure/Arrival Time Conveninet
Evaluación de modelos
Decision Tree
Code
# creamos modelo base con settings defaultmodel = DecisionTreeClassifier(random_state =42,class_weight='balanced') # default# 'criterion': 'gini',# 'max_depth': None,# 'min_samples_leaf': 1,# 'min_samples_split': 2,########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test) # probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken_dt = time.time()-t0accuracy_dt = accuracy_score(y_test, y_pred)roc_auc_dt = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy_dt))print("ROC Area bajo la Curva = {}".format(roc_auc_dt))print("Tiempo de Ejecución = {}".format(time_taken_dt))print(classification_report(y_test,y_pred,digits=5))# guardo valores roc curve y precision-recall curvefpr_dt,tpr_dt,thresholds = roc_curve(y_test, y_probs)precision_dt,recall_dt,thresh = precision_recall_curve(y_test, y_probs)# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9484089256752978
ROC Area bajo la Curva = 0.9475519760897532
Tiempo de Ejecución = 0.6076035499572754
precision recall f1-score support
0 0.95692 0.95324 0.95508 20574
1 0.93699 0.94186 0.93942 15188
accuracy 0.94841 35762
macro avg 0.94695 0.94755 0.94725 35762
weighted avg 0.94845 0.94841 0.94843 35762
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning:
456 fits failed out of a total of 1824.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
456 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 942, in fit
X_idx_sorted=X_idx_sorted,
File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 254, in fit
% self.min_samples_split
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1
warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py:972: UserWarning: One or more of the test scores are non-finite: [ nan 0.79618178 0.79618178 0.79618178 nan 0.79618178
0.79618178 0.79618178 nan 0.79618178 0.79618178 0.79618178
nan 0.79618178 0.79618178 0.79618178 nan 0.86921454
0.86921454 0.86921454 nan 0.86921454 0.86921454 0.86921454
nan 0.86921454 0.86921454 0.86921454 nan 0.86921454
0.86921454 0.86921454 nan 0.88080343 0.88080343 0.88080343
nan 0.88080343 0.88080343 0.88080343 nan 0.88080343
0.88080343 0.88080343 nan 0.88080343 0.88080343 0.88080343
nan 0.8915654 0.8915654 0.8915654 nan 0.89157738
0.89157738 0.89157738 nan 0.89157738 0.89157738 0.89157738
nan 0.89157738 0.89157738 0.89157738 nan 0.91433571
0.91433571 0.91433571 nan 0.91433571 0.91433571 0.91433571
nan 0.91434769 0.91434769 0.91434769 nan 0.91431174
0.91431174 0.91431174 nan 0.92863306 0.92863306 0.92863306
nan 0.92864505 0.92864505 0.92864505 nan 0.92865703
0.92865703 0.92865703 nan 0.92862108 0.92862108 0.92862108
nan 0.93232425 0.93232425 0.93232425 nan 0.93234822
0.93234822 0.93234822 nan 0.93238417 0.93238417 0.93238417
nan 0.93237219 0.93237219 0.93237219 nan 0.93753745
0.93753745 0.93747753 nan 0.93748951 0.93748951 0.93748951
nan 0.93742959 0.93742959 0.93742959 nan 0.9375015
0.9375015 0.9375015 nan 0.93976654 0.93975456 0.93969464
nan 0.93965869 0.93965869 0.93965869 nan 0.93974258
0.93974258 0.93974258 nan 0.93986242 0.93986242 0.93986242
nan 0.94324201 0.94318209 0.94324201 nan 0.94327797
0.94327797 0.94327797 nan 0.94291843 0.94291843 0.94291843
nan 0.94285851 0.94285851 0.94285851 nan 0.94560293
0.94545912 0.94568682 nan 0.94586659 0.94586659 0.94586659
nan 0.94548309 0.94548309 0.94548309 nan 0.94533928
0.94533928 0.94533928 nan 0.94652573 0.94653771 0.94688526
nan 0.94657367 0.94657367 0.94657367 nan 0.94636993
0.94636993 0.94636993 nan 0.94622612 0.94622612 0.94622612
nan 0.94640589 0.94629803 0.94641787 nan 0.94609429
0.94609429 0.94609429 nan 0.94626207 0.94626207 0.94626207
nan 0.9457827 0.9457827 0.9457827 nan 0.94699312
0.94690923 0.94675343 nan 0.946322 0.946322 0.946322
nan 0.94686129 0.94686129 0.94686129 nan 0.94657367
0.94657367 0.94657367 nan 0.946322 0.94620215 0.94669351
nan 0.9453992 0.9453992 0.9453992 nan 0.94638192
0.94638192 0.94638192 nan 0.94640589 0.94640589 0.94640589
nan 0.94635795 0.94646581 0.94641787 nan 0.94509959
0.94509959 0.94509959 nan 0.94653771 0.94653771 0.94653771
nan 0.94648978 0.94648978 0.94648978 nan 0.94547111
0.94575873 0.94613025 nan 0.94474006 0.94474006 0.94474006
nan 0.94542317 0.94542317 0.94542317 nan 0.94551904
0.94551904 0.94551904 nan 0.94512356 0.94529134 0.94542317
nan 0.94410489 0.94410489 0.94410489 nan 0.94525539
0.94525539 0.94525539 nan 0.94532729 0.94532729 0.94532729
nan 0.9442487 0.94456029 0.94450037 nan 0.94346972
0.94346972 0.94346972 nan 0.94503967 0.94503967 0.94503967
nan 0.94494379 0.94494379 0.94494379 nan 0.79618178
0.79618178 0.79618178 nan 0.79618178 0.79618178 0.79618178
nan 0.79618178 0.79618178 0.79618178 nan 0.79618178
0.79618178 0.79618178 nan 0.86921454 0.86921454 0.86921454
nan 0.86921454 0.86921454 0.86921454 nan 0.86921454
0.86921454 0.86921454 nan 0.86921454 0.86921454 0.86921454
nan 0.88080343 0.88080343 0.88080343 nan 0.88080343
0.88080343 0.88080343 nan 0.88080343 0.88080343 0.88080343
nan 0.88080343 0.88080343 0.88080343 nan 0.88369167
0.88369167 0.88369167 nan 0.88369167 0.88369167 0.88369167
nan 0.88369167 0.88369167 0.88369167 nan 0.88369167
0.88369167 0.88369167 nan 0.91384435 0.91384435 0.91384435
nan 0.91384435 0.91384435 0.91384435 nan 0.91384435
0.91384435 0.91384435 nan 0.91384435 0.91384435 0.91384435
nan 0.92225738 0.92225738 0.92225738 nan 0.92225738
0.92225738 0.92225738 nan 0.92225738 0.92225738 0.92225738
nan 0.92225738 0.92225738 0.92225738 nan 0.92886077
0.92886077 0.92886077 nan 0.92888473 0.92888473 0.92888473
nan 0.92886077 0.92886077 0.92886077 nan 0.92893267
0.92893267 0.92893267 nan 0.93608734 0.93608734 0.93611131
nan 0.93609933 0.93609933 0.93609933 nan 0.93605139
0.93605139 0.93605139 nan 0.93623115 0.93623115 0.93623115
nan 0.9388797 0.93883176 0.93883176 nan 0.93891565
0.93891565 0.93891565 nan 0.93892764 0.93892764 0.93892764
nan 0.93896359 0.93896359 0.93896359 nan 0.94484792
0.94483593 0.94475204 nan 0.94457228 0.94457228 0.94457228
nan 0.94468014 0.94468014 0.94468014 nan 0.94469212
0.94469212 0.94469212 nan 0.94509959 0.94494379 0.9450157
nan 0.94489586 0.94489586 0.94489586 nan 0.94469212
0.94469212 0.94469212 nan 0.94475204 0.94475204 0.94475204
nan 0.94596246 0.94605834 0.94580667 nan 0.94575873
0.94575873 0.94575873 nan 0.94608231 0.94608231 0.94608231
nan 0.94591453 0.94591453 0.94591453 nan 0.94619017
0.9461662 0.946322 nan 0.94585461 0.94585461 0.94585461
nan 0.9457827 0.9457827 0.9457827 nan 0.94589056
0.94589056 0.94589056 nan 0.94722082 0.9474725 0.94736464
nan 0.94701709 0.94701709 0.94701709 nan 0.94765226
0.94765226 0.94765226 nan 0.94726876 0.94726876 0.94726876
nan 0.94897054 0.9486949 0.9486949 nan 0.9482395
0.9482395 0.9482395 nan 0.94861101 0.94861101 0.94861101
nan 0.9484672 0.9484672 0.9484672 nan 0.94937801
0.94905443 0.94905443 nan 0.94853911 0.94853911 0.94853911
nan 0.94924618 0.94924618 0.94924618 nan 0.94850315
0.94850315 0.94850315 nan 0.94889864 0.94857506 0.94901848
nan 0.94825148 0.94825148 0.94825148 nan 0.94877879
0.94877879 0.94877879 nan 0.94838331 0.94838331 0.94838331
nan 0.94928214 0.94887467 0.94927015 nan 0.94832339
0.94832339 0.94832339 nan 0.94903046 0.94903046 0.94903046
nan 0.94804775 0.94804775 0.94804775 nan 0.94762829
0.94772417 0.94805973 nan 0.94704106 0.94704106 0.94704106
nan 0.94837132 0.94837132 0.94837132 nan 0.94737662
0.94737662 0.94737662]
category=UserWarning,
Vemos que a partir de un max_depth de 10 el accuracy del modelo mejora marginalmente y a partir de 15/16 comienza a empeorar la performance
Elegimos el estimador en base a los resultados del GridSearch
Code
model = DecisionTreeClassifier(class_weight='balanced', criterion='entropy', max_depth=16, random_state=42)########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken_dt_best = time.time()-t0accuracy_dt_best = accuracy_score(y_test, y_pred)roc_auc_dt_best = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy_dt_best))print("ROC Area bajo la Curva = {}".format(roc_auc_dt_best))print("Tiempo de Ejecución = {}".format(time_taken_dt_best))print(classification_report(y_test,y_pred,digits=5))# guardo valores roc curve y precision-recall curvefpr_dt_best,tpr_dt_best,thresholds = roc_curve(y_test, y_probs)precision_dt_best,recall_dt_best,thresh = precision_recall_curve(y_test, y_probs)# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9531066495162462
ROC Area bajo la Curva = 0.977873386455552
Tiempo de Ejecución = 0.7934095859527588
precision recall f1-score support
0 0.95989 0.95854 0.95922 20574
1 0.94394 0.94575 0.94484 15188
accuracy 0.95311 35762
macro avg 0.95192 0.95214 0.95203 35762
weighted avg 0.95312 0.95311 0.95311 35762
Code
# importancia de features plt.figure(figsize=(6,6))pd.Series(model.feature_importances_,index=X.columns).sort_values().plot(kind='barh',color='g')plt.title('Decision Tree - Feature Importance')plt.show()
Random Forest
Code
# Random Forest (Método de ensamble - Bagging)# creamos modelo base con settings defaultmodel = RandomForestClassifier(class_weight='balanced',random_state=42)# default# max_depth: None# criterion: gini# n_estimators: 100########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken_rf = time.time()-t0accuracy_rf = accuracy_score(y_test, y_pred)roc_auc_rf = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy_rf))print("ROC Area bajo la Curva = {}".format(roc_auc_rf))print("Tiempo de Ejecución = {}".format(time_taken_rf))print(classification_report(y_test,y_pred,digits=5))# guardo valores roc curve y precision-recall curvefpr_rf,tpr_rf,thresholds = roc_curve(y_test, y_probs)precision_rf,recall_rf,thresh = precision_recall_curve(y_test, y_probs)# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9629494994687098
ROC Area bajo la Curva = 0.9939333439990472
Tiempo de Ejecución = 10.66377305984497
precision recall f1-score support
0 0.95664 0.98002 0.96819 20574
1 0.97201 0.93982 0.95565 15188
accuracy 0.96295 35762
macro avg 0.96432 0.95992 0.96192 35762
weighted avg 0.96317 0.96295 0.96286 35762
Vemos que a partir de un max_depth de 15 el accuracy del modelo mejora marginalmente y la performance se mantiene estable luego de 20
Elegimos el estimador en base a los resultados del RandomizedSearch
Code
model = RandomForestClassifier(class_weight='balanced',random_state=42, criterion='entropy',max_depth=20)# Se observa que el accuracy aumenta con respecto al primero modelo pero es mínimo, ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken_rf_best = time.time()-t0accuracy_rf_best = accuracy_score(y_test, y_pred)roc_auc_rf_best = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy_rf_best))print("ROC Area bajo la Curva = {}".format(roc_auc_rf_best))print("Tiempo de Ejecución = {}".format(time_taken_rf_best))print(classification_report(y_test,y_pred,digits=5))# guardo valores roc curve y precision-recall curvefpr_rf_best,tpr_rf_best,thresholds = roc_curve(y_test, y_probs)precision_rf_best,recall_rf_best,thresh = precision_recall_curve(y_test, y_probs)# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9629774621106202
ROC Area bajo la Curva = 0.9942483934672476
Tiempo de Ejecución = 11.95407748222351
precision recall f1-score support
0 0.95790 0.97866 0.96817 20574
1 0.97022 0.94173 0.95576 15188
accuracy 0.96298 35762
macro avg 0.96406 0.96020 0.96197 35762
weighted avg 0.96313 0.96298 0.96290 35762
Code
# importancia de features plt.figure(figsize=(6,6))pd.Series(model.feature_importances_,index=X.columns).sort_values().plot(kind='barh',color='g')plt.title('Random Forest - Feature Importance')plt.show()
Regresión Logistica
Code
# creamos modelo base con settings defaultmodel = LogisticRegression(random_state=42,class_weight='balanced') # default# 'l1_ratio': None,# 'penalty': 'l2',# 'solver': 'lbfgs',########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken_rl = time.time()-t0accuracy_rl = accuracy_score(y_test, y_pred)roc_auc_rl = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy_rl))print("ROC Area bajo la Curva = {}".format(roc_auc_rl))print("Tiempo de Ejecución = {}".format(time_taken_rl))print(classification_report(y_test,y_pred,digits=5))# guardo valores roc curve y precision-recall curvefpr_rl,tpr_rl,thresholds = roc_curve(y_test, y_probs)precision_rl,recall_rl,thresh = precision_recall_curve(y_test, y_probs)# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.8963424864381186
ROC Area bajo la Curva = 0.9626823127261552
Tiempo de Ejecución = 0.8702704906463623
precision recall f1-score support
0 0.92747 0.88937 0.90802 20574
1 0.85804 0.90578 0.88127 15188
accuracy 0.89634 35762
macro avg 0.89275 0.89758 0.89464 35762
weighted avg 0.89798 0.89634 0.89666 35762
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1479: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
"(penalty={})".format(self.penalty)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning:
63 fits failed out of a total of 144.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver
% (solver, penalty)
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver
% (solver, penalty)
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver
% (solver, penalty)
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 449, in _check_solver
% (solver, penalty)
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 459, in _check_solver
solver
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1473, in fit
% self.l1_ratio
ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)
warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py:972: UserWarning: One or more of the test scores are non-finite: [ nan nan 0.89534048 0.89538841 0.89535246 0.89535246
0.89535246 0.89535246 nan nan nan nan
nan nan 0.89534048 0.89538841 0.89535246 0.89535246
0.89535246 0.89535246 nan nan nan 0.89535246
nan nan 0.89534048 0.89538841 0.89535246 0.89535246
0.89535246 0.89535246 nan nan nan 0.89537643
nan nan 0.89534048 0.89538841 0.89535246 0.89535246
0.89535246 0.89535246 nan nan nan 0.89538841]
category=UserWarning,
# best 5 estimators grid_scores = pd.DataFrame(grid_search.cv_results_)grid_scores[['rank_test_score','mean_test_score', 'std_test_score','param_max_depth','param_learning_rate','param_n_estimators','mean_fit_time','std_fit_time']].sort_values('rank_test_score').head()
rank_test_score
mean_test_score
std_test_score
param_max_depth
param_learning_rate
param_n_estimators
mean_fit_time
std_fit_time
14
1
0.963
0.000
20
0.100
200
3.721
1.332
2
2
0.963
0.000
-1
0.100
200
6.366
2.064
11
3
0.962
0.000
15
0.100
200
1.910
0.007
8
4
0.962
0.000
10
0.100
200
3.101
1.287
7
5
0.962
0.001
10
0.100
100
2.757
0.155
Dentro del TOP 5 vemos que dentro del rango de n_estimadores que analizamos, prefirio 200 ante 100, con distintos niveles de max_depth y un learning rate de 0.1
Elegimos el estimador en base a los resultados del GridSearch
Code
model = lgb.LGBMClassifier(class_weight='balanced',random_state=42, max_depth=20,n_estimators=200,learning_rate=0.1) ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken_lgbm_best = time.time()-t0accuracy_lgbm_best = accuracy_score(y_test, y_pred)roc_auc_lgbm_best = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy_lgbm_best))print("ROC Area bajo la Curva = {}".format(roc_auc_lgbm_best))print("Tiempo de Ejecución = {}".format(time_taken_lgbm_best))print(classification_report(y_test,y_pred,digits=5))# guardo valores roc curve y precision-recall curvefpr_lgbm_best,tpr_lgbm_best,thresholds = roc_curve(y_test, y_probs)precision_lgbm_best,recall_lgbm_best,thresh = precision_recall_curve(y_test, y_probs)# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9653822493149152
ROC Area bajo la Curva = 0.9954933006592798
Tiempo de Ejecución = 3.290117025375366
precision recall f1-score support
0 0.96325 0.97711 0.97013 20574
1 0.96837 0.94950 0.95884 15188
accuracy 0.96538 35762
macro avg 0.96581 0.96330 0.96449 35762
weighted avg 0.96542 0.96538 0.96534 35762
Code
# importancia de features según cantidad de veces que se utilizó el feature en el modelopd.DataFrame({'Value':model.feature_importances_,'Feature':X.columns.values}).sort_values(by="Value").plot(x='Feature',kind='barh',color='g')plt.title('LightGBM - Feature Importance')plt.show()
Si bien en la elección de los modelos mediante métodos como GridSearch o RandomizedSearch se utilizó validación cruzada, lo implementamos para comparar el performance de los cuatro mejores modelos.
Del analisis Univariado podemos decir: * En cuanto a satisfacción (nuestra variable a predecir) En un (56%) la opinión en satisfacción fue Neutral o negativa, versus 44% donde los clientes quedaron satisfechos. No hay desbalance de la variable target. * Variable genero: La proporción de hombres y mujeres encuestados es similar. * Segun histograma por edad, el rango de más concentración es entre 25-60 años. * Respecto a las clases: Es menor al 10% las personas que viajan en ecoplus, el resto se dividide entre business y eco en partes prácticamente iguales. Es probable que la oferta de ecoplus no este disponible en muchos vuelos, y no que sea por una preferencia del cliente. * De los gráficos de torta evaluando las encuestas, vemos que hay más exigencia (o menos conformidad) en wifi en vuelo y en facilidad de reserva on line.
Sobre el Análisis Bivariado: * Por lo visto en el gráfico de cajas, los clientes de business tienen un promedio de edad más elevado (+33 a 50) que quienes viajan en las otras dos (+25 a 50) * Observando el gráfico de violin(distancia/satisfacción), los clientes que realizan viajes cortos (menores a 1000 km) tienden a tener un nivel de satisfacción neutral o negativa. Esta No Satisfacción, puede tener que ver con otra variable que se relacione con viajes cortos más que con la distancia en si misma. (Por ejemplo la clase ,edad o tipo de viaje) Es conveniente evaluar distancia/variable?/satisfacción en gráficos multivariados.
Podemos ver una tabla posterior a dicho gráfico donde indica que los que viajan en clase eco en su mayoria son No satisfechos y los que viajan en clase business en mayoria son Satisfechos. También más adelante se ve en gráfico violin (Clase/distancia/satisfacción) que los viajes cortos poseen mayor concentración en clase eco, lo que explica una de las causas porque en viajes cortos hay menor grado de satisfacción.
Siguiendo con la misma idea, vemos en las tablas siguientes que los que viajan por motivos personales en su gran mayoria son No satisfechos, y viajan distancias más cortas que los que viajan por trabajo.
Observación Multivariada * La demora en la partida tiene una correlación positiva muy fuerte con la demora en la llegada (r= 0,96), sin embargo no afecta el nivel de satisfacción de los pasajeros. * Un cliente no leal es más probable que sea No satisfecho. * Observando gráfico violin (genero/edad/satisfacción) los No satisfechos poseen entre 20 a 40 años (los conformes poseen entre 40 a 60). *Del mapa de calor (satisfacción/clase/encuestas) vemos que Los satisfechos que viajaron en business, le dieron importancia a los asientos y al servicio en vuelo. Mientras que los que los satisfechos que viajaron en economica le dieron importancia al wifi y a los entretenimientos. (Dato importante para el cliente, donde fortalecer en cada clase) Se observa mediante este mapa que los No Satisfechos de todas las clases puntuaron en promedio bajo al servicio de wifi en vuelo.
De acuerdo a nuestro gráfico de correlación, observamos que aquellas variables que mayor correlación lineal con la satisfación de los pasajeros corresponden a
Online Boarding
Class
Type of Travel
Mientras aquellas que peor correlación lineal tienen con la variable de satisfacción son:
Gate Location
Gender
Departure/Arrival Time Conveninet
Del gráfico FacetGrid - (Edad vs Distancia segun clase) Se pueden ver zonas y franjas de las clases y edad. Esto permitiría hacer clustering y segmentar patrones para repensar estrategias de mercado.
Comentarios sobre los Modelos
Los modelos base ya cuentan con un buen performance, el hypertuning no logra mejoras significativas, ej: una mejora de Accuracy de 0.005 en Decision Tree.
Regresión logística no presentó una buena performance para este caso (Accuracy 89%) en comparación con nuestro modelo más simple: Decision Tree (Accuracy 95.3%).
Los dos modelos de mejores resultados fueron los modelos ensamblados: tanto por el método de bagging (Random Forest) con 96.3% como por el método de boosting (LightGBM) con 96.5%.
El modelo puede tener dos errores de predicción en nuestro caso: predice que el cliente no está satisfecho cuando en verdad si lo está, o predice que el cliente sí está satisfecho cuando en verdad no lo está. Si tuvieramos una aversión mayor a este último error entonces se elegirá el modelo que lo minimice: (Random Forest 1.2% - LightGBM 1.3% - Decision Tree 2.4% - Regresión Logística 6.4%)
Si se elige estrictamente según las métricas de Accuracy y Area Under the Curve, entonces el mejor clasificador es: LightGBM con 96.5% y 99.5% respectivamente.
Mejoras al modelo
PCA
Probamos realizando PCA a los features
Code
# utilizamos los features normalizados X_normalized # buscamos el ratio (Varianza explicada vs Número de componentes)pca = PCA()pca.fit(X_normalized)# Explained variance ratioexp_var_ratio = pca.explained_variance_ratio_# Cumulative sum of the variance ratioscumsum = np.cumsum(exp_var_ratio)# graficoplt.plot(cumsum)plt.xlabel('Número de componentes')plt.ylabel('Varianza explicada acumulada')plt.grid()
#Me quedo con 30% para test y 70% para trainX_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)
Code
# utilizo mejor LightGBM para ver si hay mejoramodel = lgb.LGBMClassifier(class_weight='balanced',random_state=42,max_depth=20,n_estimators=200,learning_rate=0.1) ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken = time.time()-t0accuracy = accuracy_score(y_test, y_pred)roc_auc = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy))print("ROC Area bajo la Curva = {}".format(roc_auc))print("Tiempo de Ejecución = {}".format(time_taken))print(classification_report(y_test,y_pred,digits=5))# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9053184944913596
ROC Area bajo la Curva = 0.9684488739159265
Tiempo de Ejecución = 1.9643268585205078
precision recall f1-score support
0 0.92268 0.91183 0.91722 20574
1 0.88244 0.89650 0.88941 15188
accuracy 0.90532 35762
macro avg 0.90256 0.90416 0.90332 35762
weighted avg 0.90559 0.90532 0.90541 35762
#Me quedo con 30% para test y 70% para trainX_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)
Code
# utilizo mejor LightGBM para ver si hay mejoramodel = lgb.LGBMClassifier(class_weight='balanced',random_state=42,max_depth=20,n_estimators=200,learning_rate=0.1) ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken = time.time()-t0accuracy = accuracy_score(y_test, y_pred)roc_auc = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy))print("ROC Area bajo la Curva = {}".format(roc_auc))print("Tiempo de Ejecución = {}".format(time_taken))print(classification_report(y_test,y_pred,digits=5))# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9322744812929925
ROC Area bajo la Curva = 0.9827689036785423
Tiempo de Ejecución = 5.976301908493042
precision recall f1-score support
0 0.94412 0.93779 0.94094 20574
1 0.91648 0.92481 0.92063 15188
accuracy 0.93227 35762
macro avg 0.93030 0.93130 0.93078 35762
weighted avg 0.93238 0.93227 0.93231 35762
Feature selection
Probamos realizar feature selection con tres técnicas distintas: Variance Threshold, SelectKBest y SelectFromModel
print(f'Shape de dataset original:{X.shape}\nShape de dataset con transformación:{X_transformed.shape}')
Shape de dataset original:(119204, 22)
Shape de dataset con transformación:(119204, 19)
Code
#Me quedo con 30% para test y 70% para trainX_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=42)
Code
# utilizo mejor LightGBM para ver si hay mejoramodel = lgb.LGBMClassifier(class_weight='balanced',random_state=42,max_depth=20,n_estimators=200,learning_rate=0.1) ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken = time.time()-t0accuracy = accuracy_score(y_test, y_pred)roc_auc = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy))print("ROC Area bajo la Curva = {}".format(roc_auc))print("Tiempo de Ejecución = {}".format(time_taken))print(classification_report(y_test,y_pred,digits=5))# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9524075834684861
ROC Area bajo la Curva = 0.9925122547541857
Tiempo de Ejecución = 3.6733322143554688
precision recall f1-score support
0 0.95797 0.95937 0.95867 20574
1 0.94485 0.94298 0.94391 15188
accuracy 0.95241 35762
macro avg 0.95141 0.95117 0.95129 35762
weighted avg 0.95240 0.95241 0.95240 35762
print(f'Shape de dataset original:{X.shape}\nShape de dataset con transformación:{X_transformed.shape}')
Shape de dataset original:(119204, 22)
Shape de dataset con transformación:(119204, 10)
Code
#Me quedo con 30% para test y 70% para trainX_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=42)
Code
# utilizo mejor LightGBM para ver si hay mejoramodel = lgb.LGBMClassifier(class_weight='balanced',random_state=42,max_depth=20,n_estimators=200,learning_rate=0.1) ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken = time.time()-t0accuracy = accuracy_score(y_test, y_pred)roc_auc = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy))print("ROC Area bajo la Curva = {}".format(roc_auc))print("Tiempo de Ejecución = {}".format(time_taken))print(classification_report(y_test,y_pred,digits=5))# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9363290643700016
ROC Area bajo la Curva = 0.9857384031675174
Tiempo de Ejecución = 3.7593648433685303
precision recall f1-score support
0 0.94651 0.94260 0.94455 20574
1 0.92267 0.92784 0.92525 15188
accuracy 0.93633 35762
macro avg 0.93459 0.93522 0.93490 35762
weighted avg 0.93639 0.93633 0.93635 35762
Select From Model
Code
# utilizo el mejor modelo LightGBMmodel = lgb.LGBMClassifier(class_weight='balanced',random_state=42,max_depth=20,n_estimators=200,learning_rate=0.1) selector = SelectFromModel(estimator=model).fit(X_normalized, y)
Code
X_transformed = selector.transform(X_normalized)
Code
print(f'Shape de dataset original:{X.shape}\nShape de dataset con transformación:{X_transformed.shape}')
Shape de dataset original:(119204, 22)
Shape de dataset con transformación:(119204, 10)
#Me quedo con 30% para test y 70% para trainX_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=42)
Se observa que utilizando el método SelectFromModel pasamos de 22 features a un dataset con 10 features y nuestro Accuracy disminuyó solamente de 0.9653 a 0.9555 y el AUC de 0.9954 a 0.9925
Code
# utilizo mejor LightGBM para ver si hay mejoramodel = lgb.LGBMClassifier(class_weight='balanced',random_state=42,max_depth=20,n_estimators=200,learning_rate=0.1) ########## Fit - Predict - Scorest0=time.time()# fit del modelomodel.fit(X_train,y_train.ravel())# predicciony_pred = model.predict(X_test)# probabilidades de las prediccionesy_probs = model.predict_proba(X_test)[:, 1]# guardo valorestime_taken = time.time()-t0accuracy = accuracy_score(y_test, y_pred)roc_auc = roc_auc_score(y_test,y_probs)print("Accuracy = {}".format(accuracy))print("ROC Area bajo la Curva = {}".format(roc_auc))print("Tiempo de Ejecución = {}".format(time_taken))print(classification_report(y_test,y_pred,digits=5))# graficar confusion matrixConfusionMatrixDisplay.from_predictions(y_test, y_pred,cmap='Blues',normalize='all')# graficar roc curveRocCurveDisplay.from_predictions(y_test, y_probs)plt.show()
Accuracy = 0.9555393993624518
ROC Area bajo la Curva = 0.9925773041519812
Tiempo de Ejecución = 4.857831716537476
precision recall f1-score support
0 0.95403 0.96943 0.96167 20574
1 0.95766 0.93673 0.94708 15188
accuracy 0.95554 35762
macro avg 0.95585 0.95308 0.95437 35762
weighted avg 0.95557 0.95554 0.95547 35762
Otras pruebas realizadas
Se realizó balanceo del target transformando el dataset, siendo 50% (satisfechos) y 50% (no satisfechos/neutrales) realizando una disminución de los registros mediante muestra aleatoria sin reemplazo pero en una prueba con el mejor modelo LightGBM observamos que el performance no mejoró.
Se realizó una transformación de las variables “Age” y “Flight Distance”:
En Age se convirtió la edad en un rango <18 menores, 18-30 joven, 30-45 joven-adulto, 45-60 adulto y >60 mayores. De esta forma convirtiendola en dummy.
En Flight Distance se convirtió la distancia de kilómetros a horas utilizando un promedio de km/hs para luego discriminar entre la duración del viaje: vuelo corto <= 3 hs, vuelo medio 3 y 8 hs y finalmente vuelo largo > 8 hs. También se la convierte en dummy.
Se realizo una pruba con el mejor modelo LightGBM realizando la transformación de las variables (así como cada una por separado) y no se observó una mejora en el performance.
Originalmente se probó tomando como valor 0, el valor nulo o NA de las encuestas, luego en el modelo final como se muestra en este proyecto, se quitaron esos registros, la performance mejoró levemente.
Tambien se probaron diversos algoritmos de ensamble donde no se notaron mejoras en la clasificación y se optó por lightGBM que fue el que mejor resultados dio.
Futuras líneas
Durante el trabajo encontramos la dificultad de mejorar tanto el Accuracy como el AUC, se intentaron distintos métodos pero no fueron satisfactorios. No se supera el nivel de Accuracy de 0.965 y AUC 0.995. Creemos que un análisis más exhaustivo de feature engineering quizás pueda superar esa barrera.